Goto

Collaborating Authors

 chinese character representation


Glyce: Glyph-vectors for Chinese Character Representations

Neural Information Processing Systems

It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. When combing with BERT, we are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including language modeling, tagging (NER, CWS, POS), sentence pair classification (BQ, LCQMC, XNLI, NLPCC-DBQA), single sentence classification tasks (ChnSentiCorp, the Fudan corpus, iFeng), dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 81.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\% on the the Fudan corpus for text classification.


Reviews: Glyce: Glyph-vectors for Chinese Character Representations

Neural Information Processing Systems

We are accepting the paper but ask the authors to carefully address the reviewers's comments and revise the paper. Please try to improve the clarity of the presentation (experimental setup, training procedure, etc). Please also include qualitative evidence that the visual model is helpful for Chinese character embedding in addition to the quantitative results. This needs to go beyond the analysis and discussion in the rebuttal.


Reviews: Glyce: Glyph-vectors for Chinese Character Representations

Neural Information Processing Systems

This paper describes a method for leveraging sub-character information from Chinese characters, and reports small but reliable improvements on a large number of Chinese NLP tasks. The paper is strong in the results that it reports. The authors show that incorporation of their "Glyce" embeddings improves results from BERT (which is SOTA on nearly all of the tasks), as well the strongest non-BERT models, for a wide variety of tasks. So it appears that the authors' methods have successfully allowed them to leverage some useful signal from the sub-character information, which seems a reasonably significant contribution for Chinese NLP. The main weakness of the paper is in clarity of the methods.


Glyce: Glyph-vectors for Chinese Character Representations

Neural Information Processing Systems

It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks.


Glyce: Glyph-vectors for Chinese Character Representations

Neural Information Processing Systems

It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks.